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ABSTRACT 


Email is the system for sending messages from one individual to another via telecommunications links between computers or terminals using dedicated software 
applications. Nowadays, Email is used most common and effective mode of communication way to communicate in personal, individual and professional level. As 
increase of email users there will be increase of spam emails from the past few years. This paper explore how email data was classified using three different classifiers 
(Naive Bayes classifier Support Vector Classifier,J48 Classifier) for detecting spam using WEKA. This experiment was performed based on dataset to find spam in 
different parameters like finding Accuracy, Recall, Precision, Fmeasures and False Position Rate etc. The final classification result should be 'l' if it is finally spam 
present , otherwise, it should be'0' for no spam. Finally this paper shows that J48 classifier is best and efficient algorithm for detection of spam emails for dataset that 


classified as binary tree among other algorithms. 


KEY WORDS: Decision tree, WEKA, dataset, classification algorithm,,Support vector Machine, Email. 


I. INTRODUCTION: 

Spam is commonly defined as unsolicited bulk email messages received without 
one's permissions, and the goal of spam detection is to distinguish between spam 
and legitimate email messages. Most of the spam contain viruses, Trojan horses 
and other harmful software that may cause to failures computer systems, net- 
works, bandwidth and storage space to slows down email servers. 


Now a days, spam mails have been increased alarmingly, prompting a need for 
anti-spam filters which are reliable, accurate, and can effectively classify legiti- 
mate mails from spam. There are several text mining and machine-learning tech- 
niques to classify spam mail have been used such as Naive Bayes, and Support 
Vector Machines, J48 Classifier, Bayes net classification and etc. 


Spammers collects email IDs from various sources such as chats, websites, 
newsgroups , malware and address details of users, which are easly available 
from other spammers for low price and bulk of messages are sent to receipients 
where, the volumes of which create enormous productivity losses to IT firms 
and huge serious security threats that carriers classified information. 

Hence, the classification of emails is prime importance to handle spam emails. 


Machine learning algorithms are used for classification of objects in different 
classes to prove efficient in classifying emails as spam or harm. Inmy_ research 
work I, used three main machine learning algorithms namely, Naive Bayes clas- 
sifier, Support Vector Classifier and J48 Classifier for classification. 


WEKA, which is a free, open-source software that compiles data-mining algo- 
rithms for machine-learning applications. WEKA is capable of performing tasks 
such as pre-processing, statistical processing and visualization of data 
(www.cs.waikato.ac.nz/ml/weka). Algorithms such as Naive Bayes classifier, 
Support Vector Classifier and J48 Classifier are applied in classifying spam mail 
were presented. Descriptions of these algorithms and a comparison of their per- 
formance using the WEKA environment have been reported. 


We have givena short review detailed description of the three classification algo- 
rithms and present the experimental details followed by results and discussion. 
Finally, we present the conclusions followed by avenues for future work. 


II. Related Work: In this study contains various previous work done by research- 
ers for classification of spam emails in brief is presented. Data-mining classifica- 
tion algorithms have been deployed to filter the spam and the legitimate emails. A 
study was conducted to compare four algorithms (J48, ID3, Alternating Decision 
Tree, and Simple CART) for classification accuracy Spam datasets were run 
through the algorithms ina WEKA environment and it was seen that J48 outper- 
formed other than three. 


Awad (2011) conducted a review to assess well-known machine-learning tools 
(k-NN, Bayesian classification, SVMs, Artificial Immune System, ANNs, and 
Rough Sets) which suits for classifying spam emails. Where compared using the 
Spam Assassin spam corpus and confirmed that Naive Bayes and Rough Set 
methods showed promise. Hence it is required that Naive Bayes and Artificial 
Immune System to improve the performance of hybrid systems, by sorting the 
dependence of features in Naive Bayes classifier or by a hybrid of the immune 


system by Rough Sets. In which Hybrid systems show most promise in spam- 
filtering efficiency. 


In another study which is conducted using the TANGARA data-mining tool to 
identify the efficient spam classifiers (Kumar et al., 2012). At First construction 
and selection were carried out on the dataset. Next, the algorithms are used on 
dataset, and after cross validation took place and the best one chosen based on the 
error rate, precision and recall. It is observed that Rnd Tree algorithm was the 
most superior one with 99% accuracy more than other. 


In another study, comparison was made among five classification algorithms, 
namely, Simple Cart, ADTree, J48, Naive Bays, ZeroR, and Random Forest Clas- 
sification Algorithm to assess ability and select the course, which best suited for 
the students based on choices (Aher and Lobo, 2012). WEKA, is used to describe 
and evaluate the result. It was observed that ADTree algorithm is a better choice 
for wrongly classified instances which are less than the other algorithms. 


In (Bhat, Sajid Yousuf, Muhammad Abulaish, and Abdulrahman A. Mirza, 

2014), Authors have been evaluated various ensemble classifiers for spammer 
detection in social network. The dataset was taken from Facebook in which 

spammer behavior has been injected by author. Instead of using content based fea- 
tures, new network structure based features were used to detect the spammers. 

Some of base classifiers namely J48, IBK, and Naive Bayes available in WEKA 
were used and evaluated. Ensemble learning approach of bagging and boosting 

were used on base classifiers (J48, IBK and Naive Bayes) and evaluated by using 

given dataset. In this Bagging ensemble learning approach using J48 is per- 

formed well and better than other evaluated classifiers. 


In (Trivedi, Shrawan Kumar, and Shubhamoy Dey,2013), Authors were com- 
pared the performance of probabilistic classifiers with and without the help of var- 
ious boosting algorithm. The dataset was taken from Enron email dataset. 
Genetic Search algorithm was used to select important features, in which 134 fea- 
tures were selected out of 1359 features. Naive bayes and Bayesian classifiers 
were evaluated first after that boosting algorithms is used to enhance the perfor- 
mance of the classifiers. Bayesian classifier is performed better than Naive 
bayes. Boosting with Resample using Bayesian Classifier has given best result 
with an accuracy of 92.9% than other classifiers. Adaboost has also given better 
results. As future work, boosting algorithms can be used with other base classifi- 
ers to do the comparison on performance. 


Various machine-learning algorithms for the filtration of spam were discussed 
and compared in another study (Islam and Chowdhury, 2005). This study cov- 
ered automated filtration and machine-learning methods based on rules and con- 
tent and used individualized, vector machines for collaborative support, and algo- 
rithms that were kernel-based for checking spam. All such techniques were com- 
pared with the advantages and presented. 


tamaPei-yu et al. (2009) Author suggested an improved Bayesian algorithm 
approach for classifying spam in which accuracy and simplicity of the KNN 
algorithm were filters spams using k-nearest neighbor and used for spam 
filteration. SVM is also used for filtering spam and finds hyper plane to classify 
the legitimate and spam mails which works with smaller training set. 
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II. DESCRIPTION OF CLASSIFICATION ALGORITHMS: 

A. Naive Bayes Classification Algorithm: It is a classification technique 
based on Bayes’ Theorem which is an assumes independence of among pre- 
dictors. Naive Bayes classifier states that “presence of one particular fea- 
ture ina class is unrelated to the presence of any other feature classes”. 


Naive Bayes model is easy to build and useful for big datasets, which is out- 
perform for highly sophisticated classification methods. 


Bayes theorem provides a way to calculate posterior probability P(c|x) from 
P(c), P(x) and P(x|c) and equationis as below: 





Class Prior Probability 
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In above equation, 


* P(c|x) is the posterior probability of class (c,target) given predictor 
(x, attributes). 


¢  P(c) is the prior probability of class. 
*  P(xIc) 1s the likelihood which is the probability of predictor given class. 
*  P(x)is the prior probability of predictor. 


Now suppose the event C represents a spam and X is ‘containing certain 
words’. Bayesian filtering would predict the probability that the message is 
really spam, given the ‘test results’ (which are certain words) Improved J48 
Classification Algorithm for the Prediction of Diabetes on performes of con- 
dition. 


Naive Bayes algorithm works: The steps to perform the Naive Bayes algo- 
rithm isas follows. 


Step 1: Convert the data set into a frequency table. 
Step 2: Create Likelihood table by finding the probabilities. 


Step 3: Now, use Naive Bayesian equation to calculate the posterior proba- 
bility for each class. The class with the highest posterior probability is the 
outcome of prediction. 


Naive Bayes works with lesser training set of data to assess the classification 
parameters, that contains few advantages. 


B. Support Vector Machine Classifier Algorithm: Support Vector Machine 
(SVM) is a discriminative classifier which is defined by a separating 
hyperplane. In two dimentional space, hyperplane is a line which divides a 
plane in two parts where each class lay inside either outside of the line. 

















Support vector machine algorithm works: Process for classifying whether 
an email is aspam or ham. 


1) Collect the dataset 
2) Filterthe collected data 


3) Separate all the messages into tokens codes 
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4) Make vector with the token code and its appearance frequency calcula- 
tion 


(X1; Y1), (X2; Y2).... (Xn; Yn) 
where 


Xi is a vector with a numeric value as the number of times token occurs 
in the message. 


Yi(+1,-1) 
which define two classes, +1 =Spam, -1 =Ham 


5) FinallySVM constructs hyper plane by plotting a vector point’s to the 
class +1 for spam class -1 for ham. 


6) Classify the data as spam or ham. 











C. J48 Classifier algorithm: Decision Tree Algorithm is used to find out the 
way that attributes-vector behaves for a number of instances. This algorithm 
generates the rules for the prediction of the target variable and helps to 
understand the critical distribution of data easliy. 


J48 is an extension of ID3. This additional features enhance accounting for 
missing values, decision trees pruning, continuous attribute value ranges, 
derivation of rules, etc. In WEKA data mining tool, J48 is an open source 
Java implementation of the C4.5 algorithm. The WEKA tool provides a 
number of options associated with tree pruning. 


Decision trees are built using J48 with input set (S=s1, s2, s3,...) of training 
data by using the concept of information entropy. Training data is sample 
classified by every sample si, which consist as a p-dimensional vector (x11, 
x21, x3i,...), where xj stands for features or attribute values of the sample and 
where the class si belongs to that. 


At each node, the algorithm selects the data attribute, which most ade- 
quately divide its sample sets into subsets, which are enriched in any one of 
the classes. The division condition is the gain of normalized information (en- 
tropy difference). This characteristic have the greatest gain of normalized 
information ,which selected for a decision purpose. This repeats for the 
smaller sub-lists too. 


J48 Classifier algorithm works: 
Base cases for this algorithm is as follows. 


1. Alist of samples reside in the same class, which occurrence forms a 
leafnode in the decision tree that enforces it to select the class. 


2. When no characteristics contribute to any information gain, a decision 
node is formed a higher level in the tree and utilize the class expected 
value. 


3. When a class not seen earlier turns up, a decision node higher up is 
formed and utilize the class expected value. 


IV. EXPERIMENTS SETUP: 

In this section, a report on experimental evaluation of the three algorithms is pre- 
sented. Firstly the dataset containing the spam email used for evaluation pur- 
poses is described in details. After then a short description of the performance 
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measures of formulas is outlined. Finally, the experimental results and related dis- 
cussions details are presented in the form of bar chart and table format. 


a. WEKA: The University of Waikato in New Zealand developed WEKA 
(Waikato Environment for Knowledge Analysis),Which is an innova- 
tion tool in data mining and machine learning research communities 
enviorment.This tool was developed by WEKA team since 1994 and 
contains many inbuilt algorithms for data mining and machine learning 
systems. It is open source and user friendly with freely available plat- 
form-independent software machine learning system. User who are not 
familiar and does't have knowledge about data mining can also use this 
software very easily ,as_ it provides flexible facilities for scripting and 
experiments evaluations. As on new algorithms appear in research lit- 
erature are updated in software and avail for users. 


b. Steps: The steps to performe using data mining in WEKA is as fol- 
lows: 
¢ Data pre-processing and visualization 


¢ Attribute selection 

* Classification (Decision trees) 
¢ Prediction (Nearest neighbour) 
¢ Model evaluation 

* Clustering (Cobweb, K-means) 
¢ Association rules 


The algorithms uses WEKA as API in MATLAB. WEKA is a compre- 
hensive open source Machine Learning system toolkit, written in Java 
platform. The basic functions provide MATLAB interface to WEKA for 
allow the transformation of data back and forth to use features available 
in WEKA suchas training classifiers. 


The algorithm is evaluated by loading arff data file into METLAB 
from WEKA. After then refining the dataset is done by applying 
classifier.Finally the results are obtained ,;which shows the accuracy and 
error rate etc. The above flow chart show the algorithms evalution pro- 
cess. 





Load ARFF from WEKA 
MATLAB 


Refine Loaded Data Set using 














c. Experimental Datasets: This study contains a spam email dataset 
which is available publicly from the UCI Machine Learning Reposi- 
tory, 1.e.SPAM E-mail Database. Which contains 57 attributes and 4601 
emails, with 1813 emails being spam and the rest (2788) being normal 
emails. The dataset is multivariate with real integer attributes and val- 
ues. 


d. Performance Evaluation: This section show the comparison of the dif- 
ferent data mining algorithms. In machine learning, particularly for sta- 
tistical classification, a explanatory table is prepared that permits us to 
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visualize the performance of the algorithm. This is the confusion or 
error matrix. Its columns represent the predicted class entries and rows 
represent those of the actual class (Table 1). It is called so because this 
matrix makes it simple to check if the system is incorrectly identifying 
two classes (e.g. reverse labelling errors). 





Table 1. Confusion Matrix 


Predicted 





Belongs Does not Belong 


Belongs True Positive (TP) False Negative (FN) 


Actual 


Does not belong False Positive (FP) True Negative (TN) 











We then define the following parameters: 





TP 


TP +TN ——aes 
TP+FN 


= Recall (R) = 
TN + TP + FN + FP 


Accuracy 


rs ig 
Precision (P) = => 
“ SP 2tP False Positive Rate = ————— 
FP+TN 
é _2xPxR 
measure = P +R 








The performance of the algorithms was analised on the basis of a compari- 
son of the above four parameters. 


V. RESULTS AND DISCUSSION: 
The spam email dataset was used as inputs to the algorithms which were run in 
the WEKA environment and final results are shown as follows. 


The results are divided in 2 categories in the class coloumn: Spam as | and No 
Spam as 0 (Legitimate). 





Table 2: A comparison of results between three algorithms 





F 























Classification Accuracy (TP+TN)/ Recall Precision FP RATE TP RATE MEASURE 
algorithm (TP+TN+FP+EN) TP/(TP+FN) TPAUTP+EP) |FP/(FP-+TN) | (=RECALL) | 5. O0 fia 
. 0.951 0.666 0.310 0.951 0.784 
NAIVE BAYES 79.56 
0.690 0.956 0.049 0.690 0.801 
0.908 0.913 0.056 0.908 0.911 
J48 92.68 
0.944 0.940 0.092 0.944 0.942 
0.831 0.918 0.048 0.831 0.873 
SVM 90.48 
0.952 0.897 0.169 0.952 0.923 















































Confusion Matrix 
Classif tation 
Algorithm 
a b (Classifedas 
1725 88 
NAIVE BAYES a=1b=0 
865 1923 
1646 167 
4148 a=1b=0 
156 2632 
1507 306 
Functions SMO a=1b=0 
134 2654 
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Comparison of Recall 
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Comparison of F-Measure 
091 0.94 nie 092 
m@ SPAM 
m NO SPAM 
Navie Bayes Trees J48 Functions SMO 
Features Ranking Categories 
1 2 3 
Accuracy J48 NAIVE BAYES SVM 
Precision SVM J48 NAIVE BAYES] Spam 
NAIVE BAYES J48 SVM No Spam 
Recall/ |NATVE BAYES J48 SVM Spam 
TPR SVM J48 NAIVE BAYES] No Spam 
F-Measure J48 SVM NAIVE BAYES] Spam 
J48 SVM NAIVE BAYES] No Spam 
False- | NAIVE BAYES J48 SVM Spam 
Positive - 
Rate (FPR) SVM J48 NAIVE BAYES] No Spam 
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In above table it indicates that J48 is the best algorithm in terms of accuracy and 
also performs better in Recall, Precision, FPR and F-measure. 


J48 creates decision trees from a labelled data and _ utilized to take a decision by 
dividing the data as reduced subsets for the study. The normalized data added 
information or entropy variation during splitting is done. The decision is taken 
based on the maximum normalization by gain ofattributes to process. 


The ability of J48 decision trees is used for missing values, value ranges, etc. 
Which makes it a superior algorithm compaired to other algorithms. In this study 
itis observed that no algorithm shows 100% accuracy for finding spam in Email 
classification . 


VI. CONCLUSION AND FUTURE WORK: 

This paper introduces a method to classify mails based on three classifiers, i.e. 
J48, SVM, and Naive Bayes. These classifiers were evaluated to separate spam 
from the email dataset by using WEKA. The emails were identified as spam (1) 
or not spam (0), which reflected the attributes of the dataset of email for spam fil- 
tering. 


The algorithm was checked against parameters such as Accuracy, Precision, 
Recall, F-measure and False Positive Rate. The analysis of the results demon- 
strated clearly that even though J48 is a very simple classifier which uses a deci- 
sion tree, it gave the most accurate result in the experiment (92.68%). In addition, 
italso performed well in other parameters, with highly favorable values and com- 
ing first in precision no spam category and F-meature spam category. On the 
other hand, it retains the second position for other parameters. Thus, it can be 
understood that J48 is the algorithm which is preferable to other algorithms that 
are compared in this study for the classification of e-mails with the purpose of fil- 
tering spams. 


SVM lassifier also showed good results with accuracy of 90.48%% and had a 
better performance results in other parameters too. But SVM is given accuracy 
of (79.56%) ,;whichis poor results in comparison to other classification. 


In further research study it is required to improve an depth anaysis algorithm like 
Genetic algorithm, and classification techniques for finding the spam In addi- 
tion, different algorithms which are not included in WEKA should be added_ to 
test and experiments with various feature of attributes selections for 
comparisions. 
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